Lab 06: Trees and Networks

Paul J. McMurdie

July 22, 2014

Outline of Lab 02, Part 2

Expected Time to Completion: 45 minutes

  • Trees (in R)
  • Tree-based distances
  • Network Manipulation
  • Network Graphics

Startup R session

Load phyloseq

The ape package provides lots of useful, efficient tools for dealing with trees in R. We load it right away.

library("phyloseq"); packageVersion("phyloseq")
## [1] '1.9.11'
library("ape"); packageVersion("ape")
## [1] '3.1.3'
library("ggplot2"); packageVersion("ggplot2")
## [1] '1.0.0'

Load data from previous lab

load("example-data.RData")
ls()
## [1] "closedps"  "metadata"  "ps0"       "qiimedata"

Trees (in R)

Trees

  • Components of “phylo” object
  • Rooting
  • Descendants
  • Pruning
  • Random Tree
  • Plotting

Trees - access

Parts of “phylo” object. It is actually a list

tree = phy_tree(closedps)
class(tree)
## [1] "phylo"
names(tree)
## [1] "edge"        "Nnode"       "tip.label"   "edge.length" "node.label"

Trees - access

The edge matrix

head(tree$edge)
##      [,1] [,2]
## [1,]   74    1
## [2,]   74   75
## [3,]   75    2
## [4,]   75   76
## [5,]   76   77
## [6,]   77   78

Trees - access

The edge matrix

tree$Nnode
## [1] 72
tree$tip.label[1:5]
## [1] "169901"  "673925"  "4470518" "1107945" "4346374"
taxa_names(tree)[1:5]
## [1] "169901"  "673925"  "4470518" "1107945" "4346374"
tree$edge.length[1:5]
## [1] 0.39755 0.08322 0.38007 0.02085 0.39352
tree$node.label[1:5]
## [1] "" "" "" "" ""

Trees - assignment

The node labels were empty. Let’s assign simulated bootstrap values to each node label using standard R list assignment semantics.

set.seed(711)
tree$node.label <- round(runif(tree$Nnode, 0 , 1), digits = 3)

Trees - assignment

All other aspects of a “phylo” object are just as easy to modify. This is both convenient and dangerous. Why?

Trees - assignment

What about assigning to a tree in a phyloseq object?

Easy! Just use the same list assignment semantics.

phy_tree(closedps)$node.label[1:5]
## [1] "" "" "" "" ""
phy_tree(closedps)$node.label <- tree$node.label
phy_tree(closedps)$node.label[1:5]
## [1] 0.290 0.426 0.325 0.630 0.445

Trees - assignment

You can even replace the entire tree.

Caution: this may side-step the automatic index checks, like OTU names, that are built-in to the phyloseq function.

phy_tree(closedps) <- tree
phy_tree(closedps)
## 
## Phylogenetic tree with 73 tips and 72 internal nodes.
## 
## Tip labels:
##  169901, 673925, 4470518, 1107945, 4346374, 4414420, ...
## Node labels:
##  0.29, 0.426, 0.325, 0.63, 0.445, 0.123, ...
## 
## Rooted; includes branch lengths.

Trees - Root

Let’s check which OTU is considered root in this tree.

is.rooted(tree)
## [1] TRUE
ROOT = tree$edge[(ntaxa(tree) + 1), 2]
# check
sum(tree$edge[, 2] == ROOT)
## [1] 1
sum(tree$edge[, 2] == ROOT+2L)
## [1] 1
rootAnc = tree$edge[(ntaxa(tree) + 1), 1]
tree$edge[(tree$edge[, 2] == rootAnc), ]
## [1] 110 111

Trees - Ape Plots

Add the node IDs on a “native R” ape tree plot

plot(tree); nodelabels()

plot of chunk unnamed-chunk-10

Trees - Ape Plots

Plot the pretend bootstrap values instead of the node ID.

plot(tree); nodelabels(tree$node.label)

plot of chunk unnamed-chunk-11

Trees - plot_tree

phyloseq defaults are a bit more legible.

plot_tree(tree, ladderize = "left", label.tips = "OTU")

plot of chunk unnamed-chunk-12

Trees - plot_tree

Can add symbols to simplify bootstrap values

plot_tree(tree, nodelabf = nodeplotboot(80), ladderize = "left", label.tips = "OTU")

plot of chunk unnamed-chunk-13

Trees - plot_tree

Can add symbols representing samples

plot_tree(closedps, nodelabf = nodeplotboot(80), ladderize = "left", label.tips = "OTU",
          color = "Treatment", justify = "left")

plot of chunk unnamed-chunk-14

Tree-based distances

Tree-based distances

  • UniFrac
  • DPCoA

Tree-based distances

Convert to relative abundance, then calculate distance.

cpsp = transform_sample_counts(closedps, function(x){x/sum(x)})
distance(cpsp, "unifrac", weighted=TRUE)
##        PC.354 PC.355 PC.356 PC.481 PC.593 PC.607 PC.634 PC.635
## PC.355 0.3611                                                 
## PC.356 0.2574 0.2234                                          
## PC.481 0.2305 0.2669 0.2411                                   
## PC.593 0.4754 0.3084 0.2929 0.3678                            
## PC.607 0.4955 0.3206 0.3527 0.3671 0.2079                     
## PC.634 0.6759 0.4067 0.5343 0.5692 0.4416 0.4429              
## PC.635 0.5547 0.3276 0.3980 0.4507 0.3519 0.2942 0.2615       
## PC.636 0.6344 0.3621 0.4803 0.5336 0.3839 0.3907 0.1602 0.2147

Tree-based distances - UniFrac

duf = distance(cpsp, "unifrac", weighted=TRUE)
dufPCoA = ordinate(closedps, "PCoA", duf)
plot_scree(dufPCoA)

plot of chunk unnamed-chunk-16

Tree-based distances - UniFrac

Make the w-UniFrac PCoA graphic

plot_ordination(cpsp, dufPCoA,
                color = "Treatment", title="w-UniFrac PCoA") + geom_point(size=7)

plot of chunk unnamed-chunk-17

Tree-based distances - UniFrac

plot_ordination(cpsp, dufPCoA, type = "split", 
                shape="Treatment", color="Phylum", title="wUF-MDS Split Plot")

plot of chunk unnamed-chunk-18

Tree-based distances - UniFrac

What about alternative ordination method, like NMDS?

How would you encode that?

Tree-based distances - DPCoA

dpcoa = ordinate(cpsp, "DPCoA")
plot_scree(dpcoa)

plot of chunk unnamed-chunk-19

Tree-based distances - DPCoA

plot_ordination(cpsp, dpcoa,
                color = "Treatment", title="DPCoA") + geom_point(size=7)

plot of chunk unnamed-chunk-20

Tree-based distances - DPCoA

plot_ordination(cpsp, dpcoa, type = "split", 
                shape="Treatment", color="Phylum", title="DPCoA Split Plot")

plot of chunk unnamed-chunk-21

Networks

Distance-based Networks

There are lots of types of networks.

Here we are discussing a very simple network that represents (dis)similarity values. Its main purpose in our case is for data exploration.

Network Manipulation

Network Manipulation

Lots on this subject. See CRAN task view(s)

phyloseq uses the igraph package for its internal network methods.

See plot_network tutorial.

Network Graphics

Network Graphics

There are two network plot functions in phyloseq:

  • plot_net
  • plot_network (legacy, igraph manipulation)

Network Graphics

plot_net

  • faster and easier to use than the original, plot_network
  • Better defaults, maps distance values to line width

Network Graphics - example

There is a random aspect to some of the network layout methods. For complete reproducibility of the images produced later in this tutorial, it is possible to set the random number generator seed explicitly:

set.seed(711L)

Network Graphics - example

Because we want to use the enterotype designations as a plot feature in these plots, we need to remove the 9 samples for which no enterotype designation was assigned (this will save us the hassle of some pesky warning messages, but everything still works; the offending samples are anyway omitted).

data("enterotype")
enterotype = subset_samples(enterotype, !is.na(Enterotype))

Network Graphics - example

The newer plot_net function does not require a separate make_network function call, or a separate igraph object. For examples running the older plot_network function, which may provide some added flexibility with igraph objects, see the plot_network section later.

Network Graphics - example

Try plot_net with the default settings.

plot_net(enterotype, maxdist = 0.4, point_label = "Sample_ID")

plot of chunk unnamed-chunk-24

Network Graphics - example

The previous graphic displayed some interesting structure, with one or two major subgraphs comprising a majority of samples. Furthermore, there seemed to be a correlation in the sample naming scheme and position within the network.

Network Graphics - example

Instead of trying to read all of the sample names to understand the pattern, let’s map some of the sample variables onto this graphic as color and shape:

plot_net(enterotype, maxdist = 0.3, color = "SeqTech", shape="Enterotype")

plot of chunk unnamed-chunk-25

Network Graphics - example

In the previous examples, the choice of maximum-distance and distance method were informed, but arbitrary.

  • What happens when maxdist value is decreased?? (hint: this will usually decrease the number of edges in the network).
  • What about other distances?
  • What can you learn from an ordination method instead?
  • What’s different about these two forms of exploratory graphics?